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In their letter to PNAS and a comprehensive set of notes on arXiv [Hll], Christian Robert, Kerrie Mengersen 
and Carla Chen (RMC) represent our approach to model criticism in situations when the likelihood cannot be 
computed as a way to "contrast several models with each other" . In addition, guided by an analysis of scalar error 
terms on simple examples, RMC argue that model assessment with Approximate Bayesian Computation under 
model uncertainty (ABC/i) is unduly challenging and question its Bayesian foundations. We thank RMC for their 
interest and their detailed comments on our work, which give us an opportunity to clarify the construction of ABC/U 
and to explain further the utility of ABC/i for the purpose of model criticism. Here, we provide a comprehensive 
set of answers to RMC's comments, which go beyond our short response [3,. For sake of clarity, we re-state RMC's 
main points in italic before we answer each of them in turn. 

We wish to emphasize that the use of multiple error terms s^.j^ is a necessary and integral part of ABC/i. In 
the first section in |4], we introduced ABC/i with the number of error terms set to if = 1 to keep the presentation 
simple. In retrospect, we hope that this initial simplification did not lead to confusion (although in later sections 
and in our applications we clearly use multiple error terms). 

Introduction and notation. Approximate Bayesian Computation (ABC) exploits model simulations a; of a 
data-generating process M for sampling from approximate posterior distributions of the model parameters 9 [5]. 
Typically, such predictions form the basis for model criticism [6 , and we propose to use the data already generated 
by Monte Carlo implementations of ABC for this purpose too 4 . In ABC/i, the dual use of the model predictions 
is reflected in an extension of the state space of the targeted random variables: whenever the simulated summaries 
S(x) = {Si{x), . . . , Sk{x)} , X ~ f{-\9M) are sufficiently close to the observed summaries S(a:o), we retain not 
only but also the computed discrepancies. The rationale of ABC/i is that small discrepancies between x and the 
observed data xq indicate favorable 9, whereas if these discrepancies are always large, the data-generating process 



(in short: model) M cannot describe the observed data well. The full potential of ABC/i is realized when we 
compute multiple discrepancies, each for one summary statistic Sk, pk{Sk{x), Sk{xn)) . From first principles, we 
derived in fi| the sampling density of the accepted pairs 



^9, i^pk{Sk{x), Sk{xo))^ ^ ^ 
which we denote by 

We obtained a formula for the "augmented likelihood" fa;o,e(£i k); which enables us to relate the posterior error 
density 



fp,r{ei.,K\xo,M) = J fp^r{d, 6-^,^1x0, M) de 

to the prior predictive error density, a well-known Bayesian quantity that was systematically discussed in a seminal 
paper [7] by Box (when K — 1 and p(S(a;), S(xo)) — x — xo)- We have 

fpAe,,j,\xo,M)^TT,^ Je,,j,\M) x Lp{e,.,j,\M) (2) 

where the prior predictive error density is given by 

Lp{ei.,K\M)= I d{{pk{Sk{x), Sk{xa)) = ek)^.j^} Tr{x\M) dx, 



and tt{x\AI) — J f{x\9, M)Trg{6\M)d9 denotes the prior predictive (data) density. The shorthand S notation 
represents a limit of functions as detailed in Section SI. 1 of the PNAS Supplementary Material The density 
TTj is fully determined by the ABC kernel in the likelihood approximation, 



fp^ri9\xo,M) = J fp,ri9,e-^.j^\xo,M) de^.j^ 

^Tre{0\M) J Tr,^,^(^(^pk{Sk{x),Sk{xo))) \ f{x\9,M) dx, 

and can be interpreted as a prior density [5] . The relationship Eq.[2]enables us to associate a statistical interpretation 
to our posterior errors and to relate them to other Bayesian quantities. 



Standard Assumptions in ABC and A'BCp. We assume that (Al) TTe^ ^ factorizes into Ofe^i'^'efci 
centered at zero and only depends on a multi-dimensional scale parameter r — (ri, . . . , ta'). The main reason 
behind (Al) is that otherwise, the same aspects of the data might be used to adjust the ABC kernel (or "prior" 
density) tt^^ as well as the magnitude of the errors Pk{Sk{x), Sk{xo)) , and hence (potentially) more than once 
to inform our quantities of interest fp,r{9\xo, M) and fp^risi-Kl^Oi M). Furthermore, ABC/i might suggest to 
falsely reject the hypothesis that a model is an adequate representation of the data if tt^^ ^ is not centered at 
zero. Typical choices are TTeJefelM) 1/Tfcl{|£fc| < Tk/2}, T:s^{ek\M) = (27rr|)-i/2 exp ( - 1/2 e'^/rl) or 
7rjj.(ei;|M) — l/rfccxp ( — 2|efe|/Ti:). We emphasize that in ABC and ABC/i, (A2) the scale parameter r of the 
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prior TTe^ ^ is in general chosen as small as possible. Otherwise, if all model simulations are "acceptable", we have 
that 

fp^ri0\xo,M) cx TTgi0\M) J TT,^^^ (^[pk{Skix) , Sk{xo))) | f{x\e,M) dx 
= ng{9\M) J const x f{x\e,M)dx 

Furthermore, (A3) the compound function x p(S{x) ,S{xo)^ must be sensitive to changes in 6. Otherwise, we 
obtain 



fp,r{d\xo,M) oc J TT, (const) f{x\e,M) dxTTg{e\M) 

= Ms\M). 

The idea is to construct useful discrepancies which reflect changes in the simulated data as 9 changes. (A4) As in 
ABC, we require that pk{Sk{x), Sk{xo)) = if and only if Sk{x) — Sk{xo). In contrast to most implementations 
of ABC, these discrepancies should be real-valued rather than non-negative. For example, in the case of scalar 
summaries, we use pk{Sk{x), Sk{xo)) = Sk{x) - Sk{xo) instead of pk{Sk{x), Sk{xo)) = \Sk{x) - Sk{xo)\ g]. We 
seek to construct (A5) roughly symmetric predictive error densities Lp(e i.j^\AI) with mode at zero under the null 
hypothesis that the prior model is an adequate representation of the data. Otherwise, negative small errors Sk < Tk 
may be significantly more (or less) frequent than positive small errors Sk < Tk under the null, and conditioning 
on error magnitude could result in a large negative (or positive) posterior mean error even if the prior model is 
correct. Finally, we assume (A6) that the cumulative density function 

Pe,.o(ei efi,...,^^^^^) = J^t[[pk{Sk{x),Sk{xo)) e £k) fix\9, M)dx 

is either continuously differentiable when the observation space X is continuous, or a step function when X is finite. 
In this case, £,xo.e{£i-K) '^^^ be re-written in terms of its elementary derivative. Next, in order to derive Eqns.[2|3j 
we also assume that the data-generating process M given by /( • \9, M) is sufficiently regular to exchange the order 
of integration and limits; recall Section SI. 1 of the PNAS Supplementary Material [1]. 



Construction of ABC^u 

1. RMC point out that "the denomination [of ^^afii^i-K) '^^ likelihood is debatable" and that "the product 
fa:o.6'(£i K)'''ei.^ (^i-if) probabilistically incoherent" This conclusion derives from at least two observa- 
tions: (i) "ixo,e is strictly speaking not proportional to a density in xq" ]^ and (ii) "^xo.9{£i-K)''^£i.K^^'i--K) 
is not invariant under reparameterization" 

- In ABC, the observed data is reduced to a set of summary statistics and compared to simulated summaries 
with a positive, scalar- valued discrepancy function p(§(x),S(a;o)). For the purpose of parameter inference. 
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we only need to plug e = /9(S(x), S(xo)) into the ABC kernel. In other words, the scalar, positive error e is 
in ABC merely a latent random variable, introduced to facilitate Bayesian computation p, [TU]. 

In |4], we derive the sampling distribution of the random variable e, and recognize the utility of the related 
multiple error terms e^.j^, each associated to one summary, for the purpose of model criticism. To us, e^.j^ 
is a random variable of particular statistical interest and not any longer a latent variable introduced for 
computational reasons. Intuitively, we shift the observed summaries by £1.^ propose to infer whether 
summaries of xq that are shifted away from zero would occur at a higher frequency and hence be more 
probable than the (unshifted) observed summaries. Formally, we define and identify the probability density 

= 5h((^Pk{Sk{x),Sk{xa))-ek)^^ f{x\e,M)dx, (4) 

where the 5h function is given in Section 1.1 of the PNAS Supplementary Material [4 . For any given xq and 
'?e,£co (^i-if) is the infinitesimal frequency with which we observe the multi-dimensional error e^.^^. As RMC 
remark insightfully, it can be called a predictive error density that conditions on the observed data and the 
model parameter 9. In [4l, we termed 

^ /p,r(a;o|6',£i:K) = ^e,xo{ei:K) (5) 
an "augmented likelihood" simply to indicate that the state space was extended. 



Example 1 Suppose we observe a single, one- dimensional data point xq, and let us believe it is Poisson 
distributed with rate 9 (denoted by Mi). Consider the scalar error e = x — xq. By construction, we have 

£,e,xo{^) — lim / Sii(x~xo — e) Poisson(a;; 61) = rr 1 {xq + £ G [0, 00)}, 

and the right hand side equals in e a Poisson distribution shifted by —Xq and in Xq a Poisson distribution 
shifted by —e. Thus, when interpreted as a function in xq, ^e.xoi^) defined for negative values. 

RMC's illuminating Poisson example serves to demonstrate how ^g^^oi^) differs from a "likelihood". However, 
RMC go beyond our construction Eq.^and truncate xq — s- S,e.xo{£) to positive values so as to re-adjust $,0^xo{^) 
to the likelihood /(xo|0, Mi) that is only defined for positive Xq |I| I^'- To be clear, this re-adjustment is not 
part of ABC PL. 



Eq. [4] corresponds to a non-parametric evaluation of the sampling model in the context of model uncertainty. 
We adhere to the sampling model in that data is simulated under the likelihood, x ^ f{^\^^ ^^'^ probe 
the model predictions in several directions at the same time. If e^.^ = 0, we have with (A4) that £,6,xo{^i-k) 
corresponds to the probability of the observed summaries under 9. For error terms different from zero, we 
quantify the probability of deviations from the observed summaries under the sampling model. Labeling 
fp,T{xo\(),£i K) Eq-[5] a "shifted likelihood" seems therefore more appropriate. Because we only shift the 
observed summaries in Eq.|4] (with no further re-adjustments towards the original likelihood as in [1] E]), 
the re-normalization required when considering Eq.|5] as a function in xg does not depend on e^.^^ , and 
fp,Tixo\9,£^.j^) is proportional to a density in xq. 
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Next, let us recall that our error terms Sk correspond directly to the compound functions x ^ pk {Sk{x), Skixn)) 
Therefore, in ABC^, a transformation of implies a change in how the data is being summarized. Typ- 
ically, such a change requires to modify the scale parameter r of the prior density tt^^ ^ when the scale of 
the discrepancies changes too. Therefore, transformations of the product ^xo,e{^)'^e{s) niust also change r in 
TTp when the Jacobian is not constant. 

In the ABC literature, it is well-known that the approximate posterior density fp^r{0\xo, M) depends on the 
choice of discrepancies and the stringency of t [5J [TTJ [T^]. Since ABC/i only uses the information provided 
in ABC to a fuller extent, the joint posterior density /p,t(^, ^i-k\^0: M) is equally sensitive to changes in the 
compound functions x — s- pk{Sk{x), Sk{xo)) and the vector r. In other words, the ABC and ABCp target 
densities fp^T{0\xo, M) and fp.rid, e^.j^Ixq, M) are not invariant under different approximation schemes. This 
leaves ABC/Lt probabilistically sound, but warrants particular caution and calls for sensitivity analyses, perhaps 
to a larger extent than is common practice. 



Model assessment 



2. Model assessment with ABCfj, requires that "the data is informative" and "is challenging" JJl [^. In the 
location-family example it is shown that the posterior error equals the prior error if the prior predictive 
density is flat. 

- We agree with RMC that ABC/i cannot criticize a model when the observed data xq reduces to a single, 
one-dimensional data point, as in their examples [2] on page 1-2. More generally, we showed that the 
posterior error can be interpreted as a weighted prior predictive error, Eq.[2][4. If the prior predictive error 
is uninformative, then the posterior error will simply reflect the weighting. 

Example 2 Suppose a Gaussian likelihood model M2 with unknown mean 6 and fixed variance 1, and a 
Gaussian prior density T:g{9\M2) — M{0', d* , h^). We have 

7r{x\M2) = J N{x\e,i)N{e-e\h^) de 

^JV{x;e\h^ + 1), 

and Lp^r{£\M2) = Af{£; 0* — xq, h'^ + 1). We mimic a situation where ng is uniform by letting h — > 00, so that 
Tr{x\Al2) and Lp^(e|M2) become improper. Suppose further a Gaussian error density Tr^{£\M2) = Af{£', 0, r^). 
Then, /p,,(e|xo,M2) = AA(£;0,r2). 

Likewise, when models have comparable parameter spaces, then the B ayes' factor will be indecisive under non- 
informative priors irg 11 3 1. Gonsider the alternative Gaussian model MI2 defined by f{x\9,M2) = M{x;6,3), 
the same prior density iTg, and let us focus on the approximate B ayes' factor 

^"'^ ^ fpltim = ^ fpAxo\0,M^)ne{e\M^)de^/(^l fp^A^o\0,M2)7re{m2)d9^ 
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to mimic the situation that we cannot readily evaluate the likelihood. We obtain 



^0 



"'^ Vr2 + /t2 + 3 ^\^(r2 + /i2 + l)(r2 + /i2 + 3) 
which tends rapidly to one as h ^ oo. 

In this setting, both our posterior error and approximate Bayes' factor give reasonable answers for the purpose 
of model criticism and model comparison respectively. Based on one data point, we cannot reject the current 
model M2 and likewise, the approximate Bayes ' factor for choosing among AI2 and a comparable model is 
indecisive. 



Clearly, there is no guarantee that ABC/i always uncovers existing model mismatch. But how difficult is 
it to uncover existing discrepancies with ABC/^ in practice? Typically, xq contains some structure and/or 
repeated observations. Instead of using just one data point in Example[2] let us imagine a data set of 100 
samples and summarize this data with two statistics, leading to a two-dimensional posterior error density. 



Example 3 Consider a data set xq of 100 independent samples that are Exponentially distributed with rate 
l/fj-t — 0.2. We believe that each sample is generated from J\f( ■ ;9,l) and consider a Gaussian prior density 
7re{9\M2) — M{9; 9* , h^). We summarize the data with the sample mean xq and the sample median median(a;), 
use the discrepancies S'fe(a;o)) — Sk{x) — Sk{xQ) and consider the prior density ^(e^.j^\M2) = 

Hfe l/Tfcexp ( - 2\ek\/Tk) with Tk = 0.1. 

To illustrate that ABCfi may reveal inappropriate prior specifications, we set 9* = and = 0.1. We applied 
the Metropolis-Hastings sampler proposed by Marjoram et al. \14^ and recorded the computed discrepancies 



to estimate our posterior error (mcmcABCfi see page 11). A more detailed discussion of various algorithms 



to sample from the ABC^i target density Eq^ will appear elsewhere. Figura^^-B show that the marginal 
densities fp.T{£k\xo,M2) are far from zero, suggesting that our strong prior beliefs are inadequate to explain 
the data. 

To illustrate that ABC^l may identify a faulty sampling model, we set 9* = 5, h'^ = 100000. Again, we 
estimated the ABCfi target density numerically with mcmcABCfi. Even though -Kg is essentially fiat, our 
marginal posterior errors do not center at zero, see Ftgurcf^-D. 

In J^, we investigated primarily the marginal posterior densities fp.risklxo, M) = J /p.T-(e]^.^|a;o, M) ds-k. 
Here, we also show heat plots which reflect more comprehensively the multi-dimensional character of our error 
density /p,r (^sr, £modian|a;o, A/2) in Figure^L-B. 



Intuitively, ABC^ will indicate model mismatch whenever all discrepancies are simultaneously not close to 
zero for any 9. To escape unidentifiability, the crux in ExamplejS] is to use multiple error terms associated 
to co-dependent summaries that may reveal model inconsistencies, see also [1] for a similar example. In 
real-world applications, (most) summary statistics are usually co-dependent, rendering ABC/i a potentially 
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Figure 1: Numerical reconstructions of the densities fp^r{£x\xo,M2), /p.r (smcdian | a^o j -^^2) in Example|3] obtained 
with samples generated by mcmcABC/i. The (A-B) posterior error under inappropriate prior specifications tts and 
the (C-D) posterior error under essentially flat prior specifications ng suggest model mismatch. 
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Figure 2: Heat plots of the density /p,r(£s, £mcdian|a:^Oj -^^2) in Example|3] obtained with samples generated by 
mcmcABC^, (A) under inappropriate prior specifications irg and (B) under essentially flat prior specifications ng. 

very powerful method to reveal model inconsistencies. Because model inconsistencies can only increase with 
the inclusion of new summary statistics, we are typically prepared to use a large set of summaries. Moreover, 
it is not required that these co-dependent summaries are sufficient for 9 under the data-generating process 
M, as we illustrate in Figure 1. A more detailed discussion will appear elsewhere; here we only note that 
these properties are appealing because in real-world applications of ABC and ABC/i, it is typically not known 
whether any set of summaries is sufficient for the parameters of a given model while it is relatively easy to 
come up with co-dependent summaries. 

However, the extent to which fp^r{£i-K\^oM) merely reflects the weighting tt^^ ^ should be checked, because 
the discrepancies might not retain enough information of the data to question a model (recall A3). The 
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perhaps simplest (but not necessarily successful) approach is to compare the posterior error density to the 
shape of TTe^^. Reassuringly, in our real-world applications, fp^risi-Kl^o-^) differs markedly from t^s^_j^] see 
e.g. Figure 3 in the PNAS paper where the prior error density is indicated in dotted lines. Crucially, since 
ABC/i does not reject a model when fp ris^.j^lxoM) is close to tt^^ ^ (recall Al), no harm is done should the 
discrepancies not be informative. 

The power of ABC/x in assessing goodness-of-fit stems, firstly, from probing a model in multiple directions 
at the same time. We hope that our simple examples illuminate the contribution of model inconsistencies, 
as reflected in multiple error terms, to model assessment. Secondly, ABC/i makes possible to criticize a 
model whose likelihood cannot be readily evaluated, and does not incur any extra computational cost when 
compared to ABC (in contrast to related predictive approaches discussed in point 4 below). 



Model criticism 



3. ABC^"is strongly impacted by prior modeling [and] fails to condition on the observed data" f^. 

- The ABC kernel, which can be interpreted as a prior density tt^^ ^ [8 , is at the heart of ABC (recall Al) and 
modulates the degree to which ABC and ABC/i condition on the observed data. In other words, the ABC and 
ABC/t target densities are sensitive to the choice of tt^^ ^ and particularly its scale parameter r. Indeed, ABC/t 
conditions on the observed data by accepting 9 in relation to the magnitudes of the K computed discrepancies 
taken together. Accordingly, under (A2, A3), the posterior error density fp^ri^i-Kl^Oi M) updates the prior 
predictive error density Lp{ei.j^\M); see Example|4] below. Based on the observation that small error boosts 
the weight of the associated value of 9 that are simulated from tt^, we say that "ABC/i criticizes a fitted 
model" . This can be illustrated with the location family in Example|2] 

Example 4 Consider again the Gaussian likelihood model M2 and a Gaussian prior density ng as in Exam- 
ple\^ For our illustration purposes, let us choose no broad but not flat: h" = 9. In this case, 

fp,rie\xQ,M2) cx AA(e; 9* - xq, 10)AA(e; 0, r^) 

where 9 = [t^ / {t^ + 10)] x [9* — xq) and = 10t^/{t^ + 10) < 10. The posterior error density "up- 
dates" the prior predictive error in that the variance of fp^ri^lxo, M2) is smaller than the one of Lp^r{£\M2). 
Furthermore, we observe that |^| is smaller than the absolute mean of Lp .^{e\M2) , reflecting the fact that 
fp^ri^lxQ, M2) criticizes a fitted model rather than the prior model (M2,Tre). 

For the purpose of model criticism, it is important to recognize that the dependency of our posterior error 
on TTe^ ^ is a good thing to the extent to which the prior irg is not an adequate model parameterization. 
The smaller r can be chosen, the more we are able to criticize a fitted model and the more we attenuate 
the influence of ire in fp,Ti£i-K\^OT M). The latter point can also be illustrated with the location family in 
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Figure 3: Heat plots of our posterior error density /p,r (sx, emedian|a;o, M^) in Examplejsjfor broad ttq (A) fia = 1000, 
(B) ao — 2 and f3o — tq — 1000 and when r is set too large, (C) ao = 2 and Po = To = 1000 and — 6.4. 

Example[2] recall Section SI. 2 in the PNAS Supplementary Material. Let us illustrate the influence of ng and 
TTg^ ^ when the summaries are co-dependent but not sufficient for 6 under model M. 



Example 5 Consider again the data set xq o/lOO independent samples that are Exponentially distributed with 
rate 0.2, suppose now that each sample is generated according to a Gaussian likelihood model with unknown 
mean /i G M and > (denoted by M^) and summarize the data with the sample mean and median. We 
consider pk{Sk{x), Sk{xQ)) = Sk{x) ~ Sk{xo), T^e^^^ (si./f IM3) = DaLi V'^fc 1 { kfc| < Tfc/2}, and the prior 
density TTe{9\M3) = 7r(^|Af3)7r(cr2|M3) where 

7T{a^\M3) ^IG{a^; ao,Po) 
Tr{^i\M3) cx 1 - ^o| < To}. 

In J^, Figure 1, we chose a slightly different prior density ng with hyperparameters fiQ = 5. Tq = 10. ao = 4 
and (3q — 75 such that the prior means of fi and are 5 and (3q/ {a^ — l) = 25, matching the empirical mean 
and the standard deviation of the observed data. 

To illustrate that ABCj-i uncovers existing model mismatch with co-dependent summaries that are not suffi- 
cient for 9 under M3 even when irg differs markedly from the data or is uninformative, we now vary these 



hyperparameters. First, let us set tq = 1000. We ran mcmcABCfi (see page 11) for 10,000 iterations to sam- 
ple from fp.ri^J-j o-^, Sx, £median\xo, M3) with Tfc sct to 1.6, and repeated this run four times from overdispersed 
starting values to assess the convergence of the chains. Samples from the burn-in period were discarded. 
Figur^^^ illustrates that our joint posterior error density remains virtually unchanged (compare to Figure IC 
in ]EI). Next, we set — 2 and (3o = tq = 1000 and ran mcmcABCfi as above. Even though ng is now 
extremely broad, our joint posterior error density continues to identify model mismatch; see Figure^^. 

ABC depends on the error threshold t, and so does ABCfi. In order to identify model mismatch with 
fp^ri^i-xl^o, M), existing conflicts among several summary statistics are uncovered by setting r sufficiently 
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small. For example, setting to 6.4 such that the acceptance probability of mcmcABCji is larger than 80%, 
the posterior error is very broad and does not suggest model mismatch; see Figure[^. 



In summary, the ability of ABC/i to criticize a fitted model is strongly modulated by the choice of discrepancies 
and the error threshold t. Probing a model under particular assumptions on ng in the directions specified by 
£-^.^ is not guaranteed to uncover existing model mismatch. For example, using ABC/i with the sample mean 
and the standard deviation (two independent summaries) in place of the sample mean and the median in 
Examplejs] fails to uncover existing model mismatch. Similarly, using the sample mean and the 25% quantile 
fail to reveal model inconsistencies as clearly as the sample mean and the sample median. In principle, the 
contribution of "the data" to /p,T(£i if I^^Oj M) can only increase with larger K and/or more stringent choices 
of T, and cannot be quantified by considering fiat tt^^^ (see also A2). 

In the directions determined by 6]^.^ and r, the criticized model comprises the sampling model M and our 
prior assumptions irg, and we acknowledge that "having no way to distinguish between prior and sampling 
model inadequacy is a difficulty" [2]. More work is needed here. 

4. From an ABC perspective, using the posterior predictive m{x\xQ, M) instead of the prior predictive 7r(a;|A/) 
"requires same computing times" J^. 

- In the context of ABC when the likelihood cannot be readily evaluated, the use of the posterior predictive 
(data) density 



is complicated by the fact that samples from the true posterior density f(9\xo, M) are in general not available. 
However, m(x\xo, M) can be approximated by 



An alternative approach for model criticism could be the approximate posterior predictive (APP) error density 



However, for a complex model the extra volatility induced by simulating from f{-\9,M) means that re- 
simulations from fp^Ti9\xo, M) need not meet the stringency requirement tt^^^. . It might therefore also be 
useful to consider the weighted approximate posterior predictive (wAPP) error density 

fp,T,Xo{£l:K\xO,M) CX Lp^r,Xo{ei:K\M)TT{ei.^K\M). 

Both Lp ^.^j.„(ej.^|M) and fp.r.xoi^i-Kl^OiM) adopt a sequential approach to model criticism, comprising 
a training step (inference of fp,T-{9\xo, M)) and a testing step (APP or wAPP). The testing step adds a 
computational overhead to typical ABC procedures. For example, it takes about two minutes to evaluate our 
seven summaries on the Saccharomyces cerevisiae PPI data set |15] . and hence an extra 2 x 500/60 > 16hrs 





Lp,t,xo{£i:k\^I) = / ^] [Pk{Sk{x),Sk{xo)) = Ek) \ mp^r{x\xo,M) dx 
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to obtain 500 samples from APP on one computer. Assuming a large acceptance probability of 10%, the 
extra time required to obtain 500 samples from wAPP is more than 6 days. By contrast, our posterior error 
incurs no additional computational cost because the discrepancies already computed in any ABC algorithm 
are only used to a fuller extent. Nevertheless, one might be prepared to pay this cost if the densities 
Lp,T.xoi^i-K\^^) fp.T.xoi^i-Kl^OjM) would have an intrinsic advantage compared to our posterior error 
density /p,^(e;^.^|a;o, Af ). 

In general, it is difficult to compare the behavior of our posterior error with APP and wAPP under model 
uncertainty. First, we note that Lp T-.xoi^i-K\^^) '^^^ fp.ri^i-Kl^Oj are very different quantities, 
relating respectively to sequential and simultaneous approaches to model criticism. This is also reffected 
in their distinct asymptotic properties as r ^ 0. Second, ExamplejT] in the Appendix demonstrates that, 
counter-intuitively, fp ,,-(0\xo,M) may be broader than 7rg under model uncertainty. 

An additional complication to be considered with APP and wAPP is that the same aspects of the data are used 
to inform both the training and the testing phase. Hence, these quantities violate the fundamental requirement 
in statistical learning that the training data be independent from the testing data [T^. It is possible to use 
different aspects of the data during both stages, and this brings us back to the partially predictive and 
conditionally predictive densities previously discussed by Bayarri and Berger |17j . Unfortunately, in real- 
world applications of ABC, it is often difficult to identify discrepancies that are independent of each other. 

5. Our estimator ^ to the augmented likelihood that is based on B repeat samples "cannot be used as a practical 
device because B is necessarily small" J7j/ . . in which case the non-parametric approximation is poor, or B 
is large in which case producing the x's is too time-consuming" JBj- 

- In our applications, we found that we obtained largest improvements in terms of the effective sampling size 
for small to moderate values of B that are computationally feasible. Let us also recall that the choice of 
proposal kernel in s^.j^ is a crucial element of the second algorithm in j4j and should not be omitted when 
considering its efficiency. We acknowledge that our observations may not readily extend to other applications. 

In the same way that we augmented standard ABC to what we call Std-ABC/Lt in it is straightforward to 
modify any existing ABC algorithm for the purpose of model criticism by using (i) many co-dependent, real- 
valued discrepancies and (ii) recording those discrepancies. For example, the Metropolis-Hastings sampler 
proposed by Marjoram et al. [14 can be adapted to provide samples from the target distribution 

ip,r{dO,dx,d£^.ji\xa,M) = — - — — (dp^,^(^)[de^.ji) dx dO) , 

Jp,t\Xq\M) 

where we put Pi:k{x) — (pk{Sk{x), Skixo))) for brevity. Here, 6p-^_^.(^x){dsi.j() denotes the Dirac measure 
at the point Pi:k{x)- Suppose initial values 9^, x^ ^ f{ - \9^, M) and set e\ = pk{Sk{x'^), Sk{xo)). 

mcmcABC/il If now at 6 propose a move to 9' according to a proposal density q{9 9'). 
mcmcABC/i2 Generate a;' ~ /( • \9' , M) and compute ej^ — pk[Sk{x'), Sk{xo)) for k — 1, ... , K. 
mcmcABC/i3 Accept {9' ,x' ,e'i.j^) with probability 

mh{9,x,ej^.ji]9\x\e[.j() = min{l , rvaniiia(6', a;, £1.;^; 61', x', ei.^)} 
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where 



''vanilla(^7 2;, Ej^.jf 6 , X , ^I-.k) 



1:K 



q{s ^ Q') 



1:K 



and otherwise stay at (0, x, Ej^.^). Then return to nicmcABC/il. 

As is standard practice, the algorithm is run sufficiently long after a certain "burn-in" period, and samples 
from the burn-in period are discarded. It is not difficult to show that marginally in (0,£-^.^), mcmcABC/i 
provides samples from /p_T-(0, ej^.j^jxo, A/) Eq.[l]for suitable proposal kernels q{9 — s- 6') under our regularity 
assumptions (A6). 

Model criticism and model comparison 

6. "The Bayesian foundations of ABCfi are questionable: the consequences of rejecting a model are ignored by 
ABCji but include constructing another model" ^11 and "this leads to wonder about the gain compared with 
using the B ayes factor" JEj. Moreover, "the estimation of Bayes' factors is even faster" '^11 and "provides a 
different answer" 12]. 

- To us, model criticism and model comparison are important and complementary aspects of statistical 
reasoning. Indeed, methods for model comparison attempt to choose between candidate models, even if all 
of them do not match the data in one or several aspects well. 

ABC is very flexible in that arbitrary data-generating processes M can be analyzed without the need to 
compute the likelihood, so long as the evaluation of the summary statistics is computationally tractable. 
ABC/i makes possible to evaluate at no extra computational cost whether a model matches the observed 
data in terms of a large set of summary statistics, and to obtain useful indications how a faulty model should 
be modified. In our work, we found that ABC/Lt thus enables to iterate rapidly through the initial stages 
of model design to identify one or a suite of models which are in agreement with the data, even when the 
likelihood cannot be readily evaluated. We believe that the ability of ABC/x to offer statistical rigor at this 
point is highly valuable to areas of modern science where complex models are now formulated to explain and 
agree with data collected across the traditional boundaries of disciplines. For example, in biology, we face 
a wealth of data that is hard to analyze in its entirety under current computer resources (e.g. molecular 
genetic data), or we have one intricate data set (e.g. molecular interaction networks), or we cross boundaries 
of biological organization (e.g. systems biology). 

The methods presented in Ratmann et al. 4 do not address the problem of choosing a model from a suite 
of candidates. Model comparison when the likelihood cannot be readily evaluated is not the topic of [4 , and 
has been introduced elsewhere [18 1 [191 l20 l [2T | [22] . 

Example 6 Let us re-visit RMC's Poisson example fllW in order to (a) illustrate model criticism with ABC'ii 
when the errors are discrete rather than continuous random variables, (b) inspect the case of asymmetric 
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Figure 4: Density plots of fp^ri£\Mi) in Examplejo] Posterior mean errors are indicated in large diamonds. (A) 
We Gx T — 2 and consider a data point xq = 1 that is in agreement with our prior belief ng — Exp(l) as well as a 
data point xq = 5 that differs from our prior model. In the latter case, the posterior mean error suggests mismatch 
between the model and the data. (B) We fix Xg — 5 and consider the prior predictive error density (corresponding 
to an essentially flat tt^ with r = 50) and the posterior error density associated to r = 2/3. Again, we observe that 
this fitted model is harder to criticize than the prior model. 



predictive error densities Lp{ei.jf\M) and (c) re-examine the behavior of the approximate marginal likelihood 
as presented in [2j, Figure 1. 

Consider the Poisson model Mi of Exampl^^ and suppose that T:g{e\Mi) = exp(-e') 1 {61 > O}. We have that 
7r(x|Mi) = / Poisson{x\9)exY>{-e)d9 ^ 2""^"^ Ijx > 0}, 



and hence Lp(e|M2) oc 2^^^°^^^^ l{a;o + £ > 0}. The ABC kernel always depends on an "error threshold" r 
(recall Al ) and, given the form of Lp{e\M2), we consider here tt^ : {0, 1, —1, 2, . . . } — + M.^ with 7re(e|Mi) (x 
2-|e|/r^ QHj. marginal posterior error is 

fp,r{£\Mi) oc 2"(^«+^+l^l/^+i) 1{xq + e > 0}, 
with normalizing constant 

00 

e=-xo 

'l _ 2(i-i/^)(2:o + i) 



2-^'"°+^') (^t{xo > 0} 



1 - 21-1/t 



(6) 



1 _ 2-1-1/' 
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Figure 5: (A) Plots of the posterior mean error / efp^r{£\xo, Mi) in Example|6] as a function of for r = 2/3 (red), 
T = 2 (black) and r = oo (grey). Provided the prior model is an adequate representation of the data (xo = 1), the 
prior predictive mean error is zero (dashed lines). By contrast, the posterior mean error is not zero in this case 
(red and black lines), simply because Lp{e\Mi) is not symmetric. (B) Plots of the ABC/i approximate marginal 
likelihood as a function of for r = 2/3 (red), r = 2 (black) and r = oo (grey). Note that this plot differs 
qualitatively from the one in [2] because RMC decided to truncate £,e,xa to positive values. 

under the assumption that Xq > 0. Figur^^ illustrates the posterior error density for various choices of 
Xq and T, and the respective posterior means are indicated in large diamonds. Note that tt^ was here only 
chosen for reasons of analytical tractahility, and we could still use our two-sided Exponential density tt^ : M — > 
where ■Ks{e\M) — 1/Texp( — 2|e|/T), or the standard indicator function. Indeed, even if we do not 
know the set of possible discrete errors under a model M , all we miss is the correct normalizing constant 
of Tie'. {0,1,-1,2,...} — !■ where ■K^{e\M) oc exp ( — 2|£|/r) . This constant need not be known, see for 
example our algorithm mcmcABCfi. 

Figur^^ illustrates the posterior mean error J e/p_T-(e|Afi) e as a function of xq for various choices of t. 
Setting t = oo, we obtain the prior predictive mean error, which is zero if the model corresponds well to 
the observed data (xq — I). Since Lp(s\M2) is not symmetric around zero when the prior model is adequate 
(opposing A5), conditioning on error magnitude results in a slightly negative posterior mean error when 
Xq = 1. 

Let us recall that RMC decided to truncate the density e — > ^o.xai^e) to non-negative values, and then plotted 
the associated marginal likelihood f trunc{xQ\M\) — JJ ^g''™'^(£)7r0(0|Mi)7re(£|Afi) dd de as a function of xq in 
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l^, Figure 1. In Figur^^, we plot the ABC^l marginal likelihood Eq^as a function of xa. In this example, 
the approximate marginal likelihood decreases monotonically for all values of t , and only small values of t 
provide a suitable approximation of the true marginal likelihood (t — Q). Thus, the posterior mean error and 
the approximate marginal likelihood both depend on the precise value of the "error threshold" , suggesting that 
sensitivity analyses are required for ABC^ as well as for complementary tools for model comparison that are 
based on approximate marginal likelihoods. 

The Bayes' factor is a tool that may address both model comparison and model criticism, depending on the 
formulation of the null and alternative hypothesis. It is possible to devise approximate Bayes' factors to test 
the null hypothesis £ = versus the alternative e 7^ as a surrogate measure for the hypothesis that the 
model describes the data adequately well, i.e. for the purpose of model criticism (unpublished results, but 
see [131 [23]). However, in both cases, the robustness of the Bayes' factor (with regard to the choice of r and 
the quality of the numerical approximation of the ABC or ABC/x target densities) is debatable (unpublished 
results). While we agree that computing the Bayes' factor proposed in [201 dS] is faster than computing 
posterior predictive checks, we also note that it cannot be faster than sampling from fp.T{0,£i.ii\xo,M) so 
long as the same ABC kernel (i.e. tt^^ ^) is used. 

7. "Comparing models via the posterior error is missing the model complexity penalisation from Bayesian model 
comparison" JI]/. 

- We acknowledge that model complexity is an important quantity to consider during model comparison. 

8. "The choice of e and i^ei^) model dependent and the comparison [of models] reflects prior modeling, not 
data assessment" lij. Finally, "using the same r across all models does not seem to be recommendable on a 
general basis" /TJ. 

- We agree that the choice of discrepancies (hence errors) and t are application- and model specific. Although 
the same summaries can typically be used across models that attempt to explain the same data, model 
predictions will typically vary and hence the scales of the simulated summaries. This implies that the same 
T may not always be used across different models. In this case, it may be difficult to compare the posterior 
error density fp^r{£\xo,M) across different models. In [4 , we only suggest to use fp^T{£\xo,M) to compare 
each model against the observed data. 

Conclusion 

We still find that ABC/x enables us to comprehensively quantify discrepancies between a data-generating process 
M and the data, simultaneously with parameter inference even when the likelihood cannot be readily evaluated, 
thus providing valuable guidance on the interpretability of parameter estimates and on how to improve models. 

However, the method has its limitations. The posterior error reffects an interplay between the prior predictive 
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error and the stringency with which that error is updated; recall Eq.[2j At present, there is no formal procedure 
to disentangle the contribution of tt^^^ and Lp^e^.j^lM) in fp^ri^i-Kl^o, M); this prompted us to caution that it is 
difficult to convincingly associate a formal, probabilistic framework with credibility intervals of fp,Ti£k\xo, M) [3]. 
In other words, there is no formal guarantee that zero is included in a 95% credibility interval with a probability 
of 0.95 under the hypothesis that the prior model is correct. We agree that the methods proposed by Verdinelli 
and Wasserman |24j are promising for the purpose of model criticism via Bayes' factors, although the sharp hull 
hypothesis ej^.^j- = has limitations in itself [25]. Moreover, we emphasize that fp,T{£i-K\^o, M) cannot be thought of 
as a purely Bayesian quantity because tt^^ ^ also determines the approximation quality of fp^ri&lxo, M), recall Eq.jsj 
In particular, this implies that the contribution of "the data" to fp^ri^i-Kl^OT M) cannot be directly quantified by 
setting TTg^ ^ uniform. Finally, we agree with RMC that the ABC and ABC/i target densities, i.e. fp^riSlxa, M) and 
/p,t(^, ei-xl-'^Oj E^re sensitive to changes in the compound functions x pk[Sk{x), Sk{xo)^ (i.e. not invariant) 
and may attain different meanings under different choices of r. This leaves the whole method probabilistically 
coherent, but calls for sensitivity analyses. 

Nonetheless, we believe that ABC and ABC/i are useful to compare observed data and model simulations in a 
coherent way and to make inference on the model parameters as well as the error terms. It is difficult to understand 
posterior quantities of fp T-{0\xo,M) in place of posterior quantities of the true posterior density f{6\xo,M), but 
it makes good sense to interpret them as quantities that lie between the prior and posterior density. With our 
inability to evaluate the likelihood f{xo\9, M), we acknowledge that it is too cumbersome (or would take too long) 
to comprehend the data in full, and turn to those aspects Sk of the data which we consider to be most relevant. 
Doing so, we retain some information of the available data and our posterior density fp^rif^lxo, M) updates our prior 
beliefs accordingly. Simultaneously, we can make use of the very same information to investigate the adequacy of a 
model M in explaining the data, and to update our prior predictions according to error magnitude. In conclusion, 
if we interpret fp^r{0\xo, M) as an update of our prior beliefs in 9 and fp,r{£i.j^\xo,M) as an update of our prior 
predictive density under M, then both are useful and meaningful quantities, particularly when the likelihood 
f{6\xo,AI) cannot be evaluated. 
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Appendix 

Example 7 Reconsider the data set xq of n = 100 independent samples that are Exponentially distributed with 
rate 1/fit = 0.2. We believe again that each sample of xo is generated from ^{^,0"^) with /i G M, cr^ > unknown, 
and take a conjugate normal inverse-gamma prior density for ji and a"^ with hyperparameters /Iq G M, rtp = 1, 
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Oq > and /?□ > 

nia^lMs) - //G(ao,/3o)(^') = PS° {cT-^"^' exp ( - Po'y-') / r(ao) 

'^^.^(ei:^ 1^3) = n {|£fc| < rfe/2}, 

wif/i hyperparameters set to /io = 5, rig = 1. uq = 4 and /?o = 75. We ran Std-ABCfi based on the summary 
SYMM(x) — X — median{x), p(§(a;), S(xo)) = SYMM{x) — SYMM{xq) and the above conjugate prior in 9 for 
20,000 iterations to obtain samples from the joint posterior density /p,t(a*, c^, SsymmIxq, M3) for various values of 
T. Interestingly, the approximate posterior density fp^-ri^lxo, M) broadens for decreasing values of t, as shown in 
Figure!^ A-B). 




Figure 6: Numerical estimates of (A) /p,r(/^|a;o: -^3) and (B) /p,r(f^|a;o, -^3) in Example|7] for decreasing values of 
TSYMM (different colors). The respective marginal prior densities are overlaid (black, dashed). 
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